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Abstract 

Recent technological advances coupled with large sample sets have un- 
covered many factors underlying the genetic basis of traits and the predis- 
position to complex disease, but much is left to discover. A common thread 
to most genetic investigations is familial relationships. Close relatives can 
be identified from family records, and more distant relatives can be inferred 
from large panels of genetic markers. Unfortunately these empirical esti- 
mates can be noisy, especially regarding distant relatives. We propose a 
new method for denoising genetically-inferred relationship matrices by ex- 
ploiting the underlying structure due to hierarchical groupings of correlated 
individuals. The approach, which we call Treelet Covariance Smoothing, 
employs a multiscale decomposition of covariance matrices to improve esti- 
mates of pairwise relationships. On both simulated and real data, we show 
that smoothing leads to better estimates of the relatedness amongst dis- 
tantly related individuals. We illustrate our method with a large genome- 
wide association study and estimate the "herit ability" of body mass index 
quite accurately. Traditionally heritability, defined as the fraction of the to- 
tal trait variance attributable to additive genetic effects, is estimated from 
samples of closely related individuals using random effects models. We show 
that by using smoothed relationship matrices we can estimate heritability 
using population-based samples. Finally, while our methods have been de- 
veloped for refining genetic relationship matrices and improving estimates 
of heritability, they have much broader potential application in statistics. 
Most notably, for error-in- variables random effects models and settings that 
require regularization of matrices with block or hierarchical structure. 



Key Words: covariance estimation, cryptic relatedness, genome-wide associ- 
ation, heritability, kinship. 
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Introduction. In the past decade tremendous progress has been made 
toward understanding the genetic basis of disease. This challenging endeavor 
has given rise to numerous study designs with a vast arsenal of statistical 
machinery. A common theme however is the pivotal role played by familial 
relationships. Traditionally relationships are encoded in pedigrees of known 
relatives (Thompson, 1974, 1975; Boehnke and Cox, 1997; Epstein, Duren 
and Boehnke, 2000; McPeek and Sun, 2000), but for more distantly related 
individuals, pedigree information can sometimes be erroneous or difficult to 
obtain. Relatedness can also be calculated from large panels of genetic mark- 
ers (Milligan, 2003; Albers et al, 2008; Anderson and Weir, 2007; Brown- 
ing, 2008; Browning and Browning, 2010; Purcell et al., 2007; Day- Williams 
et al., 2011; Yang et al., 2010a). While this approach has greatly expanded 
the scope of inference for relationships, empirical estimates are noisy, espe- 
cially regarding distant relatives. 

The search for a disease gene begins with finding unusual sharing of ge- 
netic material among individuals who share a trait (phenotype). Linkage 
analysis involves the study of joint inheritance of genetic material and phe- 
notypes within relatives (Hopper and Mathews, 1982; Almasy and Blangero, 
1998). Typically, these studies are restricted to relatives within a pedigree, 
but more recently the approach has been extended to samples of people 
who are more distantly related and without known pedigree structure (Day- 
Williams et al., 2011). Alternatively, genetic associations can be discovered 
from population samples, which are usually based on case-control studies. 
In these studies the sample is assumed to be unrelated but the presence 
of distant relatives (i.e. cryptic relatedness) can reduce power or generate 
spurious associations (Lander and Schork, 1994; Astle and Balding, 2009). 
Numerous methods have been proposed to deal with familial structure in 
genetic association studies (Choi, Wijsman and Weir, 2009; Bravo et al., 
2009; Thornton and McPeek, 2010; Kang et al., 2010), all of which require 
an estimate of family relationships among individuals within the study. 

Relationships are also critical for quantitative genetics. A common prob- 
lem for quantitative genetics is to estimate the fraction of variance of a con- 
tinuous trait, such as height, due to genetic variation amongst individuals in 
a population. This feature, known as heritability, delineates the relative con- 
tributions of genetic and non-genetic factors to the total phenotypic variance 
in a population. Heritability is a fundamental concept in genetic epidemiol- 
ogy and disease mapping. Using a variety of close relatives, the heritability 
of quantitative and qualitative traits can be estimated directly (Fisher, 1918; 
Devlin, Daniels and Roeder, 1997). With complex pedigrees, applying the 
same principles, heritability can be estimated using random effects models 
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(Henderson, 1950). Heritability of height, weight, IQ and many other quan- 
titative traits has been investigated for nearly a century and continues to 
generate interest (Deary et al., 2012). 

Interest in the genetic basis of disease is high because greater understand- 
ing of disease etiology will in principle lead to better treatments. Large 
population-based samples are enhancing our ability to identify DNA vari- 
ants affecting risk for disease and it has become the standard to search 
for genetic variants associated with common disease using genome-wide 
association studies (GWAS). Thousands of associations for common dis- 
eases/phenotypes have already been validated (Visscher et al., 2012). Nev- 
ertheless, even in the most successful cases, such as Inflammatory Bowel 
Disease studied in McGovern et al. (2010) and Imielinski et al. (2009), dis- 
coveries account for only a fraction of the heritability. 

Given the relatively limited discoveries thus far, a reasonable question is 
whether the heritability of a trait estimated from relatives truly does trace 
to genetic variation. Yang et al. (2010a) offer a novel approach to genetic 
analysis that shows that indeed much of it does. They propose to analyze 
population samples, rather than pedigrees, for the heritability of the trait. 
To do so they first estimate the correlation between all pairs of individuals 
in the population sample using a dense set of common genetic variants, 
such as those typically used for a GWAS. They then take this matrix and 
relate it to the covariance matrix of phenotypes for these subjects to derive 
an estimate of heritability. Thus in their application, where essentially all 
relatives are removed from the sample, heritability refers to the proportion 
of variance in the trait explained by the measured genetic markers. They 
provide a fascinating example of how this approach works in the case of 
human height and they and others applied these techniques to many other 
traits (reviewed by Visscher et al. (2012)). 

Yang et al. (2010a) 's work inspired us to consider applying a related 
approach to answer a different question. Could estimates of relatedness ob- 
tained from a population sample be improved by using smoothing techniques 
on the variance-covariance matrix? If so, population samples could be used 
to estimate heritability - in the traditional sense - without requiring close 
relatives. This approach has application to phenotypes for which extended 
pedigrees are difficult to obtain. For instance, there is controversy in the 
literature concerning the heritability of autism, which is typically estimated 
from twin studies (Hallmayer et al., 2011). Smoothing techniques could also 
be used to estimate relatedness in samples of distantly related individuals 
for many other genetic analyses. For example, a version of linkage analysis 
could be applied to distant relatives. 

imsart-aoas ver. 2012/02/28 file: relatedness_081012.tex date: August 13, 2012 



5 



We propose Treelet Covariance Smoothing - a novel method for smooth- 
ing and multiscale decomposition of covariance matrices - as a means to 
improving estimates of relationships. Treelets were first introduced in Lee 
and Nadler (2007) and Lee, Nadler and Wasserman (2008) as a multi-scale 
basis that extends wavelets to unordered data. The method is fully adap- 
tive. It returns orthonormal basis functions supported on nested clusters 
in a hierarchical tree. Unlike other hierarchical methods, the basis and the 
tree structure are computed simultaneously, and both reflect the internal 
structure of the data. 

In this work, we extend the original treelet framework for smoothing of 
one-dimensional signals to smoothing and denoising of variance-covariance 
matrices with hierarchical block structure and unstructured noise. Smooth- 
ing is achieved by a nonlinear approximation scheme in which one discards 
small elements in a multi-scale matrix decomposition. The basic idea is that 
if the data have underlying structure in the form of groupings of correlated 
variables, then we can enforce sparsity by first transforming the data into a 
treelet representation by a series of rotations of pairs of correlated variables, 
and then thresholding covariances. We refer to this new regularization ap- 
proach for covariance matrices with groupings on multiple scales as Treelet 
Covariance Smoothing (TCS). 

We apply TCS to genetically inferred relationship matrices, with the goal 
of improving estimates of pairwise relationships from large pedigrees and 
population-based samples. On both simulated and real data, we show that 
TCS leads to better estimates of the relatedness between individuals. Using 
these estimates allows us to estimate the heritability from population-based 
samples provided they include some distantly related individuals, a property 
that is almost inevitable in practice. Finally, we discuss how estimating 
heritability is simply a case of variance component estimation for an error- 
in-variables random effects model. Therefore, our method can be applied to 
a whole family of more general models of similar structure. 

Models. 

GWAS Panels. The human genome contains many millions of single 
nucleotide polymorphisms (SNPs) and other genetic variation distributed 
across the genome. In a GWAS it is now typical to measure a panel of at 
least 500,000 SNPs from each subject. SNPs typically have only two forms 
or alleles within a population. Whichever allele is less frequent is called the 
minor allele. The genotype of an individual at a SNP can then be coded as 
0, 1 or 2 depending on the number of minor alleles the individual has at 
that SNP. Alleles at SNPs in close physical proximity are often highly cor- 
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related (i.e., in linkage disequilibrium). When multiple SNPs are in linkage 
disequilibrium, we say one of these SNPs "tags", or represents, the others. 
Although estimates vary, well-designed panels of 500,000 SNPs do not tag 
all of the common SNPs in the genome and they tag very few of the SNPs 
with rare minor alleles (Yang et al., 2010a). Nevertheless, GWAS provide 
considerable information about familial relationships. 

Estimating Genetic Relationships. The relatedness between a pair of in- 
dividuals is defined by the frequency by which they share alleles identical by 
descent (IBD). Formally, two alleles are considered IBD if they descended 
from a common ancestor without an intermediate mutation. Within a pedi- 
gree relatives share very recent common ancestors, hence many alleles are 
IBD. For a more detailed exposition of genetic relationships, see Astle and 
Balding (2009). 

The quantity of interest in this investigation is the Additive Genetic Re- 
lationship which is defined as the expected proportion of alleles IBD for 
a pair of individuals. For individuals i and j we use Aij to denote this 
quantity, which is more familiar when viewed as the degree of relationship, 
where Rij = — log 2 (Ajj). For example, for siblings, first cousins, and sec- 
ond cousins, who are l'st, 3'rd and 5'th degree relatives, A is 1/2, 1/8 and 
1/32, respectively. Within a non-inbred pedigree A can be computed using 
a recursive algorithm (Thompson, 1986). For example, if individual i has 
parents k and I, then Aij = Aji = 1/2 (A,^ + Aji). 

For distantly related individuals, detailed pedigree information is not of- 
ten available; however, with GWAS data one can calculate genome-average 
relatedness directly (Astle and Balding, 2009). Even with complete informa- 
tion regarding IBD status of the chromosomes, the fraction of genetic ma- 
terial shared by relatives will differ slightly from the expectation calculated 
from the pedigree due to the stochastic nature of the meiotic process (Weir, 
Anderson and Hepler, 2006). For the purpose of genetic investigations, one 
could argue that genome-average relatedness is a truer measure of related- 
ness. For example, while two distantly related individuals are expected to 
share a small fraction of their genetic material, if they do not inherit any- 
thing from their common ancestor it seems appropriate to consider them 
unrelated. 

Under many population genetic models A^ can also be interpreted as 
a correlation coefficient. Let denote the scaled minor allele count for 
individual i at SNP k: zn~ = (z* k — 2pk)/{2pk{l — pk)) 1 ^ 2 , where z* k is the 
minor allele count and p^ is the minor allele frequency. For individuals i and 
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j at genetic variant k, it follows from our model that 



(1) 



Cov[zi k ,z jk ] = Aij. 



Exploiting this feature leads to a method of moments estimate of A from 
a panel of m genetic markers. To see this, let z& denote a column vector of 
observed scaled allele counts for all individuals at the fc'th SNP, then let 



where Z = (zj, . . . , z m ). The Genome- wide Complex Trait Analysis (GCTA) 
software from Yang et al. (2010b) computes this estimator. 

The method of moments estimator is unbiased if the population allele fre- 
quencies are known (Milligan, 2003). In practice, them's are estimated from 
the sample data. A criticism of this estimator is that some off-diagonal ele- 
ments are negative, which doesn't conform to the interpretation of A^ as a 
probability. Viewed as a correlation coefficient, however, negative quantities 
suggest the pair of individuals share fewer alleles than expected given the 
allele frequencies. Alternatively, maximum likelihood estimators of A have 
been developed (Thompson, 1975; Milligan, 2003), but these estimators are 
quite computationally intensive for GWAS panels. Hence, while method of 
moments estimators are typically less precise than maximum likelihood esti- 
mators, they are more commonly used when a large SNP panel is available. 

Estimating Heritability. By definition, the heritability of a quantitative 
trait (y) such as height is determined by the additive effect of many genes 
and genetic variants (g), each of small effect (i.e., the polygenic model). For 
individuals i = 1, . . . , n, suppose that the genetic effects are explained by J 
causal SNPs, and we can express the genetic effect as 



where Uj is the additive random effect of the jth causal variant, weighted by 
the scaled number Zij of minor alleles at this variant. Let g = (gi, . . . , g n Y be 
the vector of random effects corresponding to the additive genetic effects for 
individuals i = 1, . . . , n. For u = (m, . . . , uj) 1 and Z c = [zij], we write g = 
Z c u. Define G as the variance-covariance matrix of g. Assuming Var[u] = 
Ia^, it follows that 





k=l 



.1 



(3) 




(4) 



G = a; 




9 J 



imsart-aoas ver. 2012/02/28 file: relatedness_081012.tex date: August 13, 2012 



8 



where a 2 = Ja 2 . 

In the traditional model for quantitative traits a continuous phenotype y 
is modeled as 

(5) Vi = ^ + 9i + e i: 

where e = (ei, . . . , e n )* is the vector of residual effects, and y = (yi, . . . , ?/ n )* 
is the vector of phenotypes. In matrix notation, y = ljU+g+e. The residuals 
are assumed to be independent with variance-covariance equal to I a 2 and 
the random effects and residual error are assumed to be normally distributed. 
Consequently, 

(6) Var[y] = ^x 2 + /a 2 . 
The heritability of the phenotype y is defined as 

h 2 = ^- 

This quantity is more accurately known as the additive or narrow-sense 
heritability, in contrast to the broad-sense heritability, which includes non- 
additive genetic effects such as gene-gene interactions. Our inferences will 
be confined to narrow-sense heritability. 

If the causal SNPs (or good tag SNPs) and the phenotype were directly 
measured, then one could estimate h 2 based on Eq. 5 and the implied ran- 
dom effects model using maximum likelihood (REML) (Searle et al., 1992). 
Notationally, Z c is an n x J matrix that picks out J columns of the full SNP 
panel Z. In practice, Z c is not known. Few of the causal SNPs are known 
for any phenotype, and many causal SNPs will be missing from Z (i.e. not 
tagged by any measured SNPs). 

How then is h 2 estimated in practice? Assuming various subsets of indi- 
viduals in the sample are related with relationship matrix A (defined previ- 
ously) , heritability can be estimated without any knowledge of causal genetic 
variants that constitute g. From Eq. 1 and the polygenic model it follows 
that j c —7- A as J gets large. This inspires an alternative random effects 
model which has long been utilized in population genetics: 

(7) Var[ y ] = ^ 2 + /a 2 . 

Historically, A has been derived from known pedigree structure. However, 
provided some subsets of the individuals in the sample are related (even 
distantly), one can estimate A from genetic markers using either method of 
moments or maximum likelihood estimation techniques. This approach has 
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been applied frequently in quantitative genetics, especially in breeding stud- 
ies (Lynch and Ritland, 1999; Eding et al., 2001; Visscher et al., 2006; Hayes 
and Goddard, 2008). We conjecture that by using TCS, we can improve es- 
timates of A and obtain better estimates of heritability without knowledge 
of causal variants. 

Alternatively, if the sample is completely unrelated then substituting the 
result of Eq. 2 for 6 does not lead to an estimate of h 2 unless all of the 
causal SNPs have been recorded. Instead this approach estimates h 2 < h 2 , 
the proportion of the variance in phenotype explained by the SNP panel 
(Yang et al., 2010a). In this setting, TCS will not improve estimates of h 2 . 

Methods. 

Treelet Covariance Smoothing (TCS). The genetic relationship matrix 
A is a measure of the additive covariance structure that exists between indi- 
viduals due to a common genetic background. We estimate the relationship 
matrix using genotyped SNPs, but this estimate is usually noisy. Hence, we 
propose a method for improving upon this estimate using treelets. 

Treelets simultaneously return a hierarchical tree and orthonormal basis 
functions supported on nested clusters in the tree - both reflect the under- 
lying structure of the data. Here we extend the original treelet framework 
(Lee and Nadler, 2007; Lee, Nadler and Wasserman, 2008) for smoothing 
one-dimensional signals and functions, to a new means of smoothing and 
denoising variance-covariance matrices with hierarchical block structure and 
unstructured noise. The main idea is to first move to a different basis repre- 
sentation through a series of local transformations, and then impose sparsity 
by thresholding the transformed covariance matrix. We refer to the approach 
as Treelet Covariance Smoothing (TCS). The general set-up is as follows (see 
Appendix for reference to code and details on how to compute the treelet 
transformations) : 

Let z be a random vector in M. N with variance-covariance matrix S. In our 
context, z represents the scaled minor allele counts for a set of N individuals 
at any SNP, and the covariance S = A, the additive genetic relationship ma- 
trix of the N individuals (Eq. 1). Now at each level of the treelet algorithm, 
we have an orthonormal multiscale basis. Let {vi, . . . , Vj\r} denote the basis 
at the top of the tree (corresponding to level £ = N — 1 if using the notation 
in Appendix). We write 

N 

(8) z = y^CjVj, 

1=1 

where c, = (z, Vj) represent the orthogonal projections onto local basis vec- 
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tors on different scales. It follows that the covariance of z can be written in 
terms of a multi-scale matrix decomposition 

N N 

(9) S = Var(z) = ^ 7i,iV i (v i ) i + ^ 7ijVi(vj)', 

i=l i^j 

where 7^ = Var(cj) and 7y = Cov(cj, Cj). The first term in Eq. 9 de- 
scribes the diagonally symmetric block structure of the variance-covariance 
matrix. These blocks are organized in a hierarchical tree. The second term 
describes more complex structure, including off-diagonal rectangular blocks, 
which are also hierarchically related to each other in a multi-scale matrix 
decomposition. 

In practice, the covariance £ is unknown, and both the covariance matrix 
and the treelet basis need to be estimated from data. For relationship matri- 
ces, one can for example derive an estimate £ = A from marker data using 
method of moments or maximum likelihood methods. Denote the treelet 
basis derived from £ by {vi, . . . , vjv}, and write 

N N 

1=1 ijtj 

where 7^ = Var(cj) and 7^ = Cov(cj, Cj). 

Let T(Ti) be the covariance estimate after a treelet transformation, i.e. 
after applying a full set of N — 1 Jacobi rotations of pairs of correlated 
variables. A calculation shows that 

(10) %i = Vai(ci) = [T(S)]« and % = Chv(c h Cj ) = [T(£)]y, 

where c, = (z,Vj) and Cj = (z,Vj). This suggests 1 a smoothed estimate of 
the covariance by thresholding: 

JV N 

(11) S(A) = ^/ A [7 M ]v,(v i )' + ^/ A [7 M -]v J (v i )', 



with the thresholding function 

(12) AH = 



a when \a\ > A 
when \a\ < A, 



The special case d = (z,Si) and c, = (z,5j), where Si denotes the Kronecker delta 
function, corresponds to simple thresholding of the original covariance estimate. Here we 
consider more general groupings of correlated variables on different scales. 
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where A is a smoothing parameter. 

To summarize and in matrix short-hand notation: The smoothed genetic 
relationship matrix is given by 

(13) 1(A) = B h[T(A)] B\ 

where B = (vi, . . . , V/v) and T(A), respectively, denote the treelet basis and 
the covariance matrix at the top of the tree, and f\ corresponds to element- 
wise thresholding (Eq. 12). Note that to compute B we only need to know 
the Jacobi rotations at each level of the tree. This is because of Eq. 17 (in 
Appendix). More precisely, the treelet basis, B = • j( 2 ) • . . . • J^ -1 ), 
where the Jacobi rotation matrix is the rotation matrix at level i. The 
covariance estimate after a treelet transformation and before smoothing, 
Yf- = T(A) = B l AB. 

Choosing a smoothing parameter. The goal is to choose a threshold (A) 
that reduces noise in the estimated relationships. Traditional cross-validation 
is not an option because we cannot predict A%j without including persons 
i and j. Alternatively, we have an abundance of genetic information from 
which to estimate A. We propose a SNP subsampling procedure to estimate 
the tuning parameter. 

We begin by breaking the genome into independent training and test sets 
by randomly placing half the chromosomes into each set. To improve the 
efficiency of our estimate of A, we utilize a "blackout window" of length b 
to avoid sampling SNPs that are highly correlated. This b can be considered 
either in terms of physical location along the chromosome or the number 
of SNPs between any two SNPs in question. From the set of training chro- 
mosomes, select a relatively large sample of M independent SNPs to get a 
reliable estimate of A. We train our algorithm by smoothing A using TCS 
to get ^4(A), for all A £ A, where A is a grid of reasonable threshold values. 

Once we have A(X), for a given A, we subsample L SNP sets of size k 
from the test set of chromosomes. Here, k <C M and the SNPs within each 
of the L subsampled sets follow our defined blackout window, b. Then, for 
all I = 1, . . . , L, estimate the relationship matrix, Ai, based on the subset of 
SNPs. We then compare our smoothed relationship matrix, ^4(A), from the 
training chromosomes to each of the L non-smoothed relationship matrices, 
Ai, via a weighted risk function: 

L N 2 

( 14 ) h w = (N _ l)NL E E (4v - 4(A)) , 

V ' 1=1 i<j 
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where Wij is a weight associated with each element in A. Clearly, the optimal 
tuning parameter is A = argmin AgA lf(A). 

The reason for introducing the weighting scheme is because many subjects 
are nearly unrelated. Thus, we aim to upweight the loss function so that the 
preponderance of near-zero elements in the off-diagonal do not overwhelm 
the loss function. We suggest using the learned hierarchical tree to get the 
weights. More specifically, Wij = \[T(A)]ij\, corresponding to the absolute 
value of the correlations between the final groupings of individuals after 
N — 1 rotations (Eq. 10). Also, we set wa = because we are not interested 
in estimating inbreeding coefficients. It should be noted that this is a rather 
general weighting method. Other schemes may be more appropriate if there 
is a priori information suggesting the importance of particular relationships. 

Results. 

Simulations. To produce realistic simulations we started with the phased 
genomes (haplotypes) of individuals from the HapMap 3 database 2 , selecting 
two populations with European ancestry (CEU and TSI). Utilizing the small 
sample of available haplotypes, our first objective was to generate a large 
sample of haplotypes, representative of those that might be sampled from 
unrelated founders of a population. The challenge was to keep intact the 
realistic haplotype structure for a human population, including linkage dise- 
quilibrium (LD), without generating unusual sharing between the founders. 
To accomplish this goal we took the HapMap data on CEU and Tuscan 
samples, which were phased quite accurately into haplotypes, as the initial 
sample of chromosomes from which to generate founders. Now each founder 
haplotype was created by sampling pieces of chromosomes (or haplotypes) 
from the initial sample. To do so, the number of recombination spots per 
chromosome was determined using an overall recombination of 8 = 1CP 6 
per Mb, which is 100 times the normal rate of recombination for humans. 
The actual location of the recombination spots were then determined using 
the recombination map provided by HapMap, a procedure that successfully 
keeps intact the LD structure of the chromosome. From this pool of gen- 
erated haplotype pairs, chromosomes were randomly assigned to each of 
the 39 founders in each of 100 families. These founder chromosomes were 
then dropped through a seven generation pedigree. At each generation the 
chromosomes underwent recombination with an overall rate of 9 = 10 -8 at 
locations determined by HapMap's recombination map. Within each pedi- 
gree, the genotype information of twenty individuals was collected (colored 

2 http : //hapmap . ncbi .nlm.nih.gov/downloads/phasing/2009-02_phaseIII/ 
HapMap3_r2/ 
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yellow). We then sampled ten individuals of varying relatedness from this 
group with a random sampling strategy that favored individuals of distant 
relatedness within the pedigree. The highest pairwise relatedness within a 
family is .125, corresponding to R = 3, and the lowest is < .001. Individuals 
from different families are unrelated. Each simulation produced a total of 
1,000 individuals made up of 100 ten-member families of varying levels of 
relatedness. Finally, the entire process was repeated fifty times. 

Because we know the pedigree structure, we can compare the unsmoothed 
estimate A to A found via TCS. Here, we use the GCTA software (Yang 
et al., 2010b) to estimate A using 100,000 randomly chosen SNPs with MAF 
> .05. The optimal level of smoothing (A) is chosen via the subsampling 
scheme described previously using M = 5, 000, b = 10, k = 50, L = 50 and 
repeating everything ten times. Here, b is in terms of number of SNPs. We 
choose A by examining a plot of H (A) across a grid of A values. The optimal 
smoothing parameter is the one that minimizes the risk function, H. For 
one such simulation sample we can see from Figure 2 that A ~ .051. 

The question then becomes, does TCS improve estimates of relatedness? 
Figures 3 and 4 display boxplots comparing the root mean square error 
(RMSE) of A to A at varying levels of known pairwise relationship values. 
For a full comparison, we have included two smoothing methods: TCS, as 
previously described, as well as "simple thresholding" , wherein the elements 
of A are directly thresholded. (The latter approach is a degenerate case 
of TCS models, at level I = 0, for which the basis is the Dirac basis, i.e. 
v i = v, = 5i for i = 1, . . . , N in Eqs. 8-13.) Moving from left to right in the 
figures, the true degree of relatedness increases from R = 4, . . . , 11, to no 
relatedness. Over the entire matrix of estimates, the RMSE is .0055, .0015 
and .0019 for the unsmoothed (A), TCS (A^) and simple thresholding (A s ) 
methods respectively, demonstrating an overall advantage of TCS. As with 
many shrinkage methods, TCS introduces a slight bias that is reflected in a 
higher RMSE for closely related individuals. Consequently, TCS has a larger 
RMSE than the unsmoothed estimate for smaller values of R. Where TCS 
gains a notable advantage over the unsmoothed estimate is in differentiating 
between more distantly related individuals and noise. From Figures 3 and 5 
we can see that simple thresholding incurs a substantially larger RMSE for 
closer relationships because it thresholds too aggressively. For R = 4, 70% of 
the pairs are zeroed out, and for R > 4 virtually all pairs are zeros out. Nat- 
urally, this method has the smallest RMSE for the sample of unrelated pairs 
because thresholding zeros out all of these entries. Notably, TCS does almost 
as well in this setting. A direct comparison of RMSE does not fully reflect 
the true loss incurred in practice. In most genetic studies close relatives are 
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often recorded in pedigrees and hence estimates are not required. Alterna- 
tively, considering distant relatives to be unrelated leads to a substantial 
loss for estimating heritability and most other genetic applications. 

Heritability in Health ABC Study. Body Mass Index (BMI) is one of 
several traits measured as part of the study entitled "Whole Genome Asso- 
ciation Study of Visceral Adiposity" as part of the Health Aging and Body 
Composition (Health ABC) Study. These data are archived in the Database 
for Genotypes and Phenotypes (dbGaP) 3 . We restrict our attention to those 
1644 individuals with self-reported European ancestry. To control for con- 
founding, prior to analysis, we adjust BMI scores by regressing out age, 
gender and collection site. Our objective is to estimate heritability of BMI 
from this population sample. Published heritability estimates range from as 
low as 0.05 to as high as 0.90 (Allison et al., 1996); however, based on es- 
timates derived from known pedigrees, the heritability of BMI is estimated 
to be approximately 50-75% (Kangas-Kontio et al., 2010; Zabaneh et al., 
2009). 

From the full sample of SNPs (Illumina 1M platform) we remove those 
with missingness greater than .1% and MAF < .01. From these we select a 
subpanel of 90, 000 SNPs, chosen to be nearly evenly spaced. Based on these 
SNPs, we calculated the relationship matrix A, and find that the individuals 
are predominately unrelated. The most highly related pair appear to be third 
degree relatives, such as first cousins. And more than half of the pairs appear 
to be more distantly related than 10'th degree relatives. 

To estimate the heritability in this setting, we input the smoothed rela- 
tionship matrix in Eq. 1 1 into the REML algorithm. The required smoothing 
parameter A is selected in two ways: (i) minimizing the loss function in Eq. 14 
via the subsampling approach; and (ii) using a profile likelihood approach. 
With both techniques, we get estimates of the heritability that are very close 
to what is found in the literature. 

For a range of smoothing parameters, < A < .40, we calculate the 
smoothed relationship matrix, A\, and plug this value into the REML model 
to obtain a profile likelihood (Figure 6). Also plotted in this figure is h\, the 
heritability that maximizes REML as a function of A (or minimizes -2 times 
the log-likelihood). Without smoothing (A = 0), which is not shown in the 
plot, h? = 0.23. Smoothing the relationship matrix results in an increasing 
estimate of the heritability which stabilizes at about 70%. Further smoothing 
beyond the range displayed leads to a numerically unstable optimization 
problem and diminished likelihood. Using the profile likelihood approach, A 

3 http://www.ncbi. nlm.nih.gov/gap 
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is chosen to be the point at which REML is maximized. This method results 
in an estimate of A = .20 corresponding to h 2 = 0.71. Smoothing using our 
SNP subsampling scheme results in A = .18 and h 2 = 0.72. 

For comparison, we have repeated the above experiments with an orthog- 
onal basis computed by principal component analysis (PCA) in lieu of a 
treelet basis. Such an approach does not improve the estimates of family 
relationships or heritability. When noise is present, PCA is unable to un- 
cover the underlying sparse structure of the relationship matrix. In fact, the 
results with PCA are identical to those without smoothing (with the profile 
likelihood peaking when the tuning parameter is set to 0). 

Another trait that was measured in this study is Abdomen Visceral Fat 
Density (AVFD). As was the case with BMI, we restricted our attention to 
individuals of European descent and regressed out age, sex and collection 
site. According to the literature, the heritability of AVFD should be between 
45-70% (Katzmarzyk, Perusse and Bouchard, 1999). According to Figure 6, 
one can see that using the smoothing parameter based on our subsampling 
scheme (A = .18) we get h 2 = .29. On the other hand, exploiting the profile 
likelihood plot results in A = .09 and h 2 = .36. When no smoothing was used 
(not shown in figure), A = .11. Thus, both methods for choosing the smooth- 
ing parameter used in TCS resulted in estimates of the heritability that are 
closer to what's established in the literature than without smoothing. 

It is notable that h 2 for both traits increased towards the established 
estimate of heritability regardless of how we estimate the optimal smoothing 
parameter, because only a small fraction of the genome was sampled by the 
SNP panel. Thus, our results underscore the fact that the quantitative trait 
model given in Eq. 5 does not require measurement of the causal SNPs that 
constitute Eq. 3. What is required is a good estimate of A based on relatives. 

Our analysis of BMI and AVFD illustrates the difference between esti- 
mates of heritability in the traditional sense and estimates of h 2 , the her- 
itability attributable to the SNPs in the panel. From Eqs. 6 and 7 it is 
clear that heritability derived from the classic quantitative traits model can 
distinguish between variance explained by relatives and variance explained 
by causal SNPs only if either (i) all causal SNPs are excluded, or (ii) all 
relatives are excluded. Because a large number of undiscovered SNPs scat- 
tered across the genome are likely to be causal, and large samples invariably 
contain distantly related individuals, some ambiguity will always be present. 

Clearly the 90, 000 SNPs in our panel do not explain a substantial fraction 
of the variation in BMI and yet we obtain an accurate estimate of heritability 
using TCS. The increase in estimated heritability of BMI from 23% to 72% 
suggests that smoothing improves the estimate of A and that a substantial 
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fraction of the correlation in BMI in our sample is due to genetic related- 
ness. In a similar study with a larger population sample Yang et al. (2011) 
estimated hr s of BMI at 17% when using the full SNP panel, but excluding 
all detectable relatives. Assuming relatives were successfully removed, they 
conclude that approximately 17% of the variability in BMI is explained by 
common variants included or tagged by the SNP panel. 

Discussion. Recently, there has been an upsurge of papers on sparse 
covariance matrix estimation; see Bickel and Levina (2008), Cai and Liu 
(2011) and references within. Most of this research concerns the problem of 
estimating population covariance matrices from samples of multivariate data 
in the "large p-small n" regime using banding or thresholding techniques 
in the original coordinate system. Our setting is slightly different with a 
more complex data structure: We want to improve estimates of a large co- 
variance matrix (A) in which we expect a hierarchical block structure due 
to clustering of distantly related individuals. A noisy estimate of covariance 
is obtained from a large sample of SNPs, each of which contains very little 
information. This matrix is interpreted as the additive genetic relationship 
matrix and it can be used to infer degree of relationship between pairs of 
individuals. 

We propose a new method, which we call treelet covariance smooth- 
ing (TCS), for regularizing real symmetric matrices with hierarchical block 
structure and unstructured noise. We show how a subsampling strategy ap- 
plied to SNPs can be used to chose the tuning parameter for the smoothing 
procedure. For simulated data, we show that TCS does indeed improve es- 
timates of family relationships. As an application we show how TCS can 
be used to estimate heritability of quantitative traits from a genome-wide 
sample of SNPs by smoothing relationships estimated from those SNPs. We 
then apply TCS to the problem of estimating the heritability of body mass 
index (BMI) and abdomen visceral fat density (AVFD) in the Health ABC 
data set. In particular, BMI heritability is usually quoted to be at least 
0.50, but an estimate based on a noisy estimate of A yields a much lower 
value of 0.23. By denoising the estimated relationship matrix with treelets, 
we increase the estimated heritability of BMI from 0.23 to 0.72. AVFD her- 
itability analysis produces similar results. Thus, a careful examination of 
heritability estimates using more distant relatives demonstrates that one 
may substantially improve relationship estimates using TCS. 

Other covariance regularization schemes exist in the literature, but sys- 
tematic comparison is beyond the scope of this work. Direct application 
of regularization methods for a sample covariance matrix (Z C Z^.) is some- 
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times further complicated if we do not have direct access to the multivariate 
data matrix Z c . Cai and Liu (2011), for example, describe a state-of-the-art 
adaptive thresholding method for heteroscedastic problems that requires an 
estimate of the variability of the entries of a sample covariance matrix. To 
our knowledge, TCS is the only principled approach to regularization of gen- 
eral similarity matrices with block structure on multiple scales. In addition, 
the computed basis vectors themselves contain information of the internal 
structure of the data - a topic that we will explore in a separate paper with 
applications to complex extended pedigrees. One can also easily modify the 
TCS algorithm so that positive semi-definiteness is always guaranteed. 

Our results are relevant to a recent area of burgeoning interest in genet- 
ics, namely, the estimation of heritability from population samples (Yang 
et al., 2010a). However our purpose is to estimate heritability, as tradition- 
ally defined, rather than to determine the fraction of variation explained 
by measured SNPs. We expect that the TCS-refined genetic relationships 
will find wide application to other problems in genetics, such as population- 
based linkage analysis (Day- Williams et al., 2011), along with linear mixed 
models for testing association (Kang et al., 2010). 

Furthermore, TCS can be applied to a whole family of mixed effects 
"error-in- variables" models of the form 

(15) y = WP + Zu + e, 

where y G M n is a vector of response variables; f3 G M p is a vector of fixed 
effects; u£R 5 represents random effects; and e G M ra is a vector of residual 
errors. In the general case, we assume that there are c random effects, where 
each random effect originates from a specific distribution with zero mean 
and unknown variance. In vector-matrix notation: 

/U! \ 

u= I and Z = (Zi, . . . , Z c ) 

where u, is a x 1 vector whose elements are the levels of the ith random 
factor, q = q\ + . . . + q c , and Z{ is an n x q-i matrix of regressors for the ith 
random factor. Assuming E(u) = E(e) = and 



u 




" D 


" 


e 







E 



Var 

where D = diag(ofl 9l , . . . , a^.I gc ), yields E[y] = W(3 and 

c 

Var[y] = ZDZ 1 + E = ^ afZiZl + E 

i=l 
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where the variance components af,...,a^ and E are unknown and to be 
estimated. Now consider an error-in-variables scenario in which the matrix 
W of regressors of fixed effects is known, but we only have noisy estimates of 
some or all of the positive semi-definite (psd) matrices ZiZj associated with 
the random effects. If these matrices have block structure and the noise is 
unstructured, then one could potentially improve estimates of variance com- 
ponents by first applying TCS. In our application, for example, we looked 
at a special case where we first estimate the psd matrix Z C Z\ in an additive 
polygenic model using marker-based data, and then use a denoised estimate 
of Z c Z l c to estimate the variance components, a 2 g and a 2 in a random effects 
model where D = o~gI and E = a\l. 

In summary, we have introduced a new method, called Treelet Covariance 
Smoothing (TCS), that regularizes a relationship matrix estimated from a 
large panel of genetic markers. In the context of a GWAS study a huge 
number of SNPs are measured, each of which provides information about 
the relationship between individuals in the sample. We proposed a SNP 
subsampling procedure that exploits this rich source of information to choose 
a tuning parameter for the algorithm. We illustrated one instance of the 
utility of such estimates by substituting the resulting smoothed relationship 
matrix into a random effects model to estimate the heritability of body mass 
index. While others have used genetically inferred estimates of relatedness 
from samples of close relatives to estimate heritability, we believe this is 
the first time such estimates have been applied to population-based samples 
when the goal is to estimate heritability in the traditional sense. 
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Fig 1. Pedigree of a single family used for simulations. Genomes were dropped through 
the entire pedigree and ten individuals were sampled from the twenty possible highlighted 
individuals. Individuals 35-39 are unrelated to everyone else in the pedigree. 
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Fig 2. Cross-validation plot showing the weighted risk function at varying levels of the 
thresholding parameter, A. The optimal X is the point where the H(X) (CV Score) is min- 
imized. 
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Fig 3. Boxplots of RMSE for unsmoothed (A) along with smoothed using TCS (Af) and 
simple thresholding (As) at increasing degrees of relatedness (R = 4, 5,6; see header). Here, 
TCS is better than simple thresholding as the latter method thresholds too aggressively. 
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Fig 4. Boxplots of RMSE for unsmoothed (A) along with smoothed using TCS (A-Q and 
simple thresholding (A s ) at increasing degrees of relatedness (R = 7, 8, 9 — 11). Also in- 
cluded is the comparison of RMSE values for unrelated pairs (R = Inf) and average RMSE 
for the entire relationship matrix (R — Total). We see that both thresholding methods re- 
move noise, but TCS works better than simple thresholding overall. 



imsart-aoas ver. 2012/02/28 file: relatedness_081012.tex date: August 13, 2012 



2G 



00 

d 



CD 

d 



d 



CM 

d 



o 
d 



n 



ATS 

3 



ATS 
4 



ATS 

5 



ATS 
6 



ATS 

7 



ATS 

8 



ATS 

9 



ATS 

10 



ATS 
11 



Fig 5. Barplots of the percentage of relationships that are equal to for no smoothing (A), 
smoothing using TCS (T) and simple thresholding (S). The three cases are compared at 
increasing degrees of relatedness (R = 3, ... ,11). Any value below e = 1CP 5 is considered 
to be 0. 
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Fig 6. Estimating heritability in the Health ABC data set. Solid curve is the estimated heri- 
tability at increasing values of the smoothing parameter A. The dashed curve is oc —2log{C), 
where, C is the maximum profile likelihood obtained from the REML algorithm. The solid 
vertical line is the optimally chosen threshold value using our subsampling scheme. The 
dashed vertical line represents the optimally chosen threshold value when minimizing the 
likelihood profile. A: For BMI, h 2 = .72 when using subsampling to choose an optimal 
smoothing parameter (A = .18). Similarly, h 2 — .71 when using the profile likelihood plot 
(A = .20). With no smoothing (A = 0), h 2 — .23. This is not shown on the plot. B: For 
AVFD, h 2 = .29 when using our subsampling approach to choose an optimal smoothing 
parameter. However, h' 2 = .36 when using the profile likelihood plot (A = .09). These are 
compared to h 2 = .11, the heritability when there is no smoothing (not shown). 
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APPENDIX: THE TREELET ALGORITHM 



Hierarchical clustering algorithms offer an easily interpretable description 
of data in terms of a dendrogram and some measure of similarity between 
the observations. So called agglomerative hierarchical methods start at the 
bottom of the tree and merge, at each level, the two groups with the high- 
est inter-group similarity into one larger cluster. The novelty of the treelet 
algorithm proposed in Lee, Nadler and Wasserman (2008) is the construc- 
tion of not only clusters or groupings of variables, but also functions on the 
data. More specifically, treelets construct a multi-scale orthonormal basis on 
a hierarchical tree. As in standard multi-resolution analysis (MRA) (Mallat, 
1999), the algorithm returns a set of "scaling functions" defined on nested 
subspaces Vq D V\ D . . . D Vl, and a set of orthogonal "detail functions" 
defined on residual spaces {We}f =1 where Vt © We = Vg-\. These functions 
are well- localized in space; in fact, they are supported on nested clusters in 
a hierarchical tree. 

The treelet algorithm, as well as its implementation is available in R on 
CRAN as the treelet library. Here we summarize the original published al- 
gorithm; see Lee, Nadler and Wasserman (2008) for a theoretical analysis. 
At each level of the tree, we group together the most similar variables and 
replace them by a coarse-grained "sum variable" and a residual "difference 
variable". In this paper, we measure similarity by Pearson's correlation co- 
efficient but other choices are also possible. The new variables are then com- 
puted by a local PCA or Jacobi rotation in two dimensions. Unlike Jacobi's 
classical method for eigendecompositions (Golub and Van Loan, 1996), dif- 
ference variables (i.e. 2nd local principal components) are stored, and only 
sum variables (i.e. 1st local principal components) are processed at higher 
levels of the tree. Hence, the multi-resolution analysis. The details of the 
complete treelet algorithm are as follows: 

• At level 1 = (the "bottom" of the tree), each observation or "sig- 
nal" z is represented by the original variables = (so.i, . . • , so,tv)*> 
where so t k = £&. Associate to these coordinates, the Dirac basis Bq = 
(<fo,i) ^0,2 > • • • , 0o,.zv) where Bq is the N x N identity matrix and 0o,j 
are unit vectors. Compute estimates of the covariance and similarity 



matrices, and where = , ^ == . Initialize the index 



set of "sum variables" , Sq = {1, 2, . . . , N}. 
• Repeat for £ = 1, . . . , L, where L < N — 1. 

1. Find the two most similar sum variables according to the 
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similarity matrix 1 \ Let 

(16) {a t , fa} 



2. 



arg max M>- 



where i < j, and maximization is only over pairs of sum variables 
that belong to the set Sg-±. As in standard wavelet analysis, 
difference variables (defined in step 3) are not processed. 
Perform a local PCA on this pair. Find a Jacobi rotation 
matrix 

- 1 ... o ••• ••• 



(17) 



J(ae,p e ,e e ) = 



















••• 1 

where c = cos (0g) and s = sin (9g), that decorrelates z a and zp; 
more specifically, find a rotation angle 9g such that \9g\ < 7r/4 

and EJS = Ej£ = 0, where = J'S^^J. This transfer- 

«p pa ' 

mation corresponds to a change of basis Bg = Bg_\J, and new 
coordinates = J*z^ -1 ). 
Update the similarity matrix AfW accordingly. 
3. Multi-resolution analysis. For ease of notation, assume that 



"[0 

> after the Jacobi rotation, where the indices a and 
/3 denote the first and second principal components, respectively. 

Define the sum and difference variables at level £ as sg = z a ' 

(I) 

and dg = Zp . Similarly, define the scaling and detail functions cf)g 
and ipg as columns ag and fa of the basis matrix Bg. Remove the 
difference variable from the set of sum variables, Sg = Sg~\ \{fa}. 
At level £, we have the orthonormal treelet decomposition 



N-g 



(18) 



y2 sg,i0g,i +^2djipi. 



i=i 



i=i 



where the new set of scaling vectors {(j>gi}f^i is the union of the 
vector (f>g and the scaling vectors {<f)g-i t j}j^ a ,i3 from the previous 
level, and the new coarse-grained sum variables {sg : i}^J[ e are the 
projections of the original data onto these vectors. As in stan- 
dard multi- resolution analysis, the first sum is the coarse-grained 
representation of the signal, while the second sum captures the 
residuals at different scales. 
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So,l So, 2 SO, 3 So, 4 So, 5 



Fig 7. (Left) A toy example of a hierarchical tree for data of dimension N = 5. At 
I — 0, the signal is represented by the original N variables. At each successive level 
£ — 1,2,..., N—l the two most similar sum variables are combined and replaced by the sum 
and difference variables (se,de) corresponding to the first and second local principal com- 
ponents. (Hight ) Signal representation at different levels. The s- and d-coordinates 
represent projections along scaling and detail functions in a multi-scale treelet decomposi- 
tion. Each such representation is associated with an orthogonal basis in H N that captures 
the local eigenstructure of the data. (This figure has been adopted from Lee, Nadler and 
Wasserman (2008).) 



Fig. 7 (left) shows an example of a treelet construction for a "signal" of 
length N = 5, with the data representations at the different levels of the 
tree shown on the right. The s-components (projections in the main principal 
directions) represent coarse-grained "sums" . We associate these variables to 
the nodes in the cluster tree. Similarly, the (i-components (projections in the 
orthogonal directions) represent "differences" between node representations 
at two consecutive levels in the tree. For example, in the figure, diipi = 

{SO, 100,1 + 50,200,2) - 51^1,1 • 

The output of the algorithm can be compactly summarized in terms of an 
ordered set of rotations and pairs of indices {6i, ag, f3e}e=ii where L < N — 1 
is the height of the associated hierarchical tree. 

In terms of computational complexity, a naive implementation of the 
treelet algorithm with an exhaustive nearest neighbor search corresponds 
to a total of 0(N 3 ) operations. By using a fast nearest neighbor search that 
is linear in time, one can further reduce the computational cost to 0(N 2 ). 



imsart-aoas ver. 2012/02/28 file: relatedness_081012.tex date: August 13, 2012 



